
# GlobalDiplomacyNet Dataset

## General Information

This dataset introduces a global dataset of diplomatic news and images compiled from the official webpages of ministries of foreign affairs and chief executive offices across over 150 countries and 250 websites. It includes over __1.15 million news articles__ and __1.18 million images__, with extracted entities and information about people depicted in the images.

### Full Author List
- Nihat Mugurtay
- Kaan Guray Sirin 
- Mehrdad Heshmat Najafabad
- Ahmet Taha Kahya
- Fazli Goktug Yilmaz 
- Yasser Zouzou
- Batuhan Bahceci 
- Ayca Demir 
- Dogukan Tosun
- Meltem Müftüler-Baç
- Onur Varol

### Author/Principal Investigator Information
Name: Nihat Muğurtay
Institution: Sabancı University
Email: nmugurtay@sabanciuniv.edu
<!-- ORCID: -->



* Date of data collection: Data collection concluded on **2024-12-31**, with the dataset current as of that date.
* Funding: This work is supported by TUBITAK under the grant agreement 223K173. We also thank TUBITAK 121C220 for their partial support.



## Data Overview

### Folder Structure

We followed a specific naming convention to define folders and all relevant data places in that directory. Each folder name is composed of three components separated by underscores (`_`):
- **`ISO-3166-COUNTRY-CODE`** — The three-letter ISO 3166 country code 
- **`TYPE`** — Indicates the type of website:
  - `mofa` → Ministry of Foreign Affairs  
  - `exec` → Chief Executive 
- **`COUNT`** _(optional)_— A sequence number when multiple versions of the same country and type exist

### File Structure

Every country folder has two files:
- **`news.jsonl`**: Parsed news content stored in [JSON Lines](https://jsonlines.org/) format. Each line represents one article and contains the following fields:

  ```json
  {
    "id": UUID for the news article,
    "url": News URL,
    "date": Published date in ISO_8601 format,
    "title": Title of the news article,
    "content": Content in English,
    "lang": Language of the content in ISO_639-1,
    "title_original": Original title if it is not in English,
    "content_original": Original content if it is not in English,
    "entities": {
      "persons": [Detected Person Entities],
      "countries": [Detected Country Entities],
      "organizations": [Detected International Organization Entities]
    },
    "wikidata_qids": {
      "persons": [Wikidata QIDs for person entities],
      "countries": [Wikidata QIDs for country entities],
      "organizations": [Wikidata QIDs for organization entities]
    }
  }
  ```
- **`images.jsonl`**: Each line represents an image some parsed information
  ```json
  {
    "id": UUID for the image,
    "news-id": UUID for the news article from news.jsonl,
    "url": URL for the image,
    "male-count": male count in the image,
    "female-count": female count in the image
  }
  ```
### Summary Statistics
We've also added a summary statistics.xlsx file for all the websites listing:
<!-- the country ISO-3166 code, website type, news count, time span, median content length in characters, average number of images per news, percentage of news translated, and the URL host -->
- Country ISO-3166 code  
- Website type (`mofa` or `exec`)  
- Total number of news articles  
- Time span covered by the data (in years) 
- Median content length (in characters)  
- Average number of images per news article  
- Percentage of news items that were translated  
- Source website host (URL domain)


### Example Tree View
```bash
GlobalDiplomacyNet/
├── USA_mofa/
│  ├── news.jsonl
│  └── images.jsonl
├── FRA_exec_2/
│  ├── news.jsonl
│  └── images.jsonl
├── ...
│
├── README.md
└── statistics.xlsx
```
## Methodological Information

### Data Collection:
The data was collected through web scraping. Examples of scraping and parsing codes can be found in the project's [Github Repository](https://github.com/ViralLab/GlobalDiplomacyNet-Dataset.git).



## Usage/Access Notes

The dataset is freely available for research and non-commercial use. Users are encouraged to cite this work when using the dataset in publications or projects.

```bibtex
@article{,
  title={},
  author={},
  journal={},
  year={},
  url={}
}
```
